Graphical EDA I: continous data

Team 1

09 November 2020

Structure

1. Examining Continuous Variables

1.1. Introduction

1.2. What features might continuous variable have?

1.3. Looking for features

1.4. Comparing distributions by subgroups

1.5. What plots are there for individual continuous variables?

1.6. Plot options

1.7. Modelling and testing for continuous variables

1.8. Main points of chapter

2. Looking for Structure: Dependency Relationships and Associations

2.1. Introduction

2.2. Features you can see with scatterplots

2.3. Looking at pairs of continuous variables

2.4. Adding models: lines and smooths

2.5. Comparing groups within scatterplots

2.6. Scatterplot matrices for looking at many pairs of variables

2.7. Scatterplot options

2.8. Modelling and testing for relationships between variables

2.9. Main points

3. Investigating Multivariate Continuous Data

3.1. Introduction

3.2. What is a parallel coordinate plot (pcp)?

3.3. Features you can see with parallel coordinate plots

3.4. Parallel coordinate plots and time series

3.5. Parallel coordinate plots for indices

3.6. Options for parallel coordinate plots

3.7. Main points

1. Examining Continuous Variables

1.1. Introduction

A continuous variable in theory can take any value over its range. A handy exemple of countinuous variable can be:

Different plotting methods are used to display the data distribution for continuous variables.

The purpose of those methods is to highlight important features of data. Some plots emphasise one feature over another, some are very specialised, some require more deciphering than others.

Two possible approaches are to use a range of different plot types or to use a variety of different plots of the same type.

The plots types we are going to use in this chapter are histograms and boxplots.

1.2. What features might continuous variable have?

Let’s see a list of features for continuous variables:

Graphics are good for displaying the features**

They can provide more and different kinds of information than a set of summary statistics.

1.3. Looking for features

Next we are going to analyse a data set using histograms in order to see which features might be present and how we can find them.

The Galton’s Height Data is a famous 1885 study of Francis Galton exploring the relationship between the heights of adult children and the heights of their parents.

The galton dataset from UsingR package includes data on heights for 928 children and 205 ‘midparents’. Using the code below we are creating Children histogram and Parents histogram.

data(galton, package="UsingR")
ht <- "height (in)"
par(mfrow=c(1,2), las=1, mar=c(3.1, 4.1, 1.1, 2.1))
with(galton, {
hist(child, xlab=ht, main="Children", col="green")
hist(parent, xlab=ht, main="Parents", col="blue")})

#### First things first, we can see that: - the distributions are roughly symmetric and there is more spread amongst the children; - there appear to be no outliers; - although the histograms appear to have different binwidths they are actually the same;

To look for gaps or heaping we create next histograms:

par(mfrow=c(1,2), las=1, mar=c(3.1, 4.1, 1.1, 2.1))
with(galton, {
MASS::truehist(child, h=0.1)
MASS::truehist(parent, h=0.1)})

In both histograms there appear to be narrower gaps between values at the boundaries. Clearly the data were aggregated or reported to a limited level of precision because we can see how few height values actually occur.

Next figure displays the histograms one above the other with equal scales and binwidths. The x-axis scale limits were chosen to include the full range of the data and the y-axis limits were chosen by inspection.

library(ggplot2)
library(gridExtra)
c1 <- ggplot(galton, aes(child)) + geom_bar( binwidth=1) +
xlim(60, 75) + ylim(0, 225) + ylab("") +
geom_vline(xintercept=median(galton$child),
col="red")
p1 <- ggplot(galton, aes(parent)) + geom_bar( binwidth=1) +
xlim(60, 75) + ylim(0, 225) + ylab("") +
geom_vline(xintercept=median(galton$parent),
col="red")
grid.arrange(c1, p1)

We can see that Parents’ heights are spread less and their median is slightly higher. Although the means are very similar, the parents’ distribution is clearly less variable. This is because each parent value is an average of two values.

1.4. Comparing distributions by subgroups

A population can be made up of several groups. To get more informations, we can compare the distributions of variable across the groups.

In order to approach this problem, we can use Boxplots because they make such an efficient use of the space available.

Boxplots:

The boxplot() function takes in any number of numeric vectors, drawing a boxplot for each vector. We will use airquality dataset in order to see the air quality for each month; In the example, Temp can be our numeric vector. Month can be our grouping variable, so that we get the boxplot for each month separately. In our dataset, month is in the form of number 5=May, 6=June and so on).

boxplot(Temp~Month,
data=airquality,
main="Different boxplots for each month",
xlab="Month Number",
ylab="Degree Fahrenheit",
col="orange",
border="brown"
)

It is clear from the above figure that the July month (7) is relatively hotter than the rest. Although the boxplots convey a lot of information in a limited space, they have two disadvantages: If a distribution is not unimodal, you can’t see that, and it is difficult to convey information on how big the different groups are.

1.5. What plots are there for individual continuous variables?

To display continuous data graphically we can use:

1.6. Plot options

What are the options for plottings?

1.7. Modelling and testing for continuous variables

Let’s see the way of modelling and testing for continuous variable:

1.8. Main points of chapter

2. Looking for Structure: Dependency Relationships and Associations

2.1. Introduction

Roles of scatterplots are:

Over ten thousand athletes competed in the London Summer Olympics of 2012. The figure below shows a scatterplot of Weight against Height from the dataset oly12 in the package VGAMdata.

library(ggplot2);
data(oly12, package="VGAMdata")
ggplot(oly12, aes(Height, Weight)) + geom_point() +
ggtitle("Athletes at the London Olympics 2012")

2.2. Features you can see with scatterplots



Features that might be found in a scatterplot include:

2.3. Looking at pairs of continuous variables

Drinks Wages

A scatterplot of average wages against proportion of drinkers for all 70 trades in Pearson’s DrinksWages dataset in the HistData package shows that there are a surprising number of trades where either all workers are drinkers or none are.

data(DrinksWages, package = "HistData")
ggplot(DrinksWages, aes(drinks / n, wage)) + geom_point() +
  xlab("Proportion of drinkers") + xlim(0, 1) + ylim(0, 40)

Observations

2.3. Looking at pairs of continuous variables (cont.)

Old faithful geyser

The dataset geyser in MASS provides 299 observations of the duration of eruptions and the time to the next eruption for the Old Faithful geyser in Yellowstone Park.

library(ggplot2);
data(geyser, package="MASS");
ggplot(geyser, aes(duration, waiting)) + geom_point();

To assess the possibility of clustering, consider a density estimate. The figure below displays contours of a bivariate density estimate supporting the idea of there being three clusters.

ggplot(geyser, aes(duration, waiting)) + geom_point() +
geom_density2d()

2.4. Adding models: lines and smooths

Pearson heights

Adding lines or smooths to scatterplots is easy and often provides valuable guidance, as can be seen comparing the figures below, where the relationship between sons’ and fathers’ heights, from dataset father.son, is difficult to assess from the scatterplot alone.

Scatterplot with best regression line and y=x diagonal

data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red") +
geom_abline(slope=1, intercept=0)
## `geom_smooth()` using formula 'y ~ x'

Plotting a smoother and the best regression line together

data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red", se=FALSE) +
stat_smooth()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

2.5. Comparing groups within scatterplots

The Olympic athletes’ dataset shown in Chapter 2.1 suffered from overplotting, which can be overcome by splitting up the data by possible explanatory variables. In this case, we can separate the plot by sex: a scatterplot for females and one for males.

ggplot(oly12, aes(Height, Weight)) +
geom_point(size = 1) + facet_wrap(~Sex, ncol=1)

Plotting all the 42 sports from the dataset together we get the following figure:

oly12S <- within(oly12, Sport <- abbreviate(Sport, 12))
ggplot(oly12S, aes(Height, Weight)) +
  geom_point(size = 1) + facet_wrap( ~ Sport) +
  ggtitle("Weight and Height by Sport")

2.6. Scatterplot matrices for looking at many pairs of variables

Scatterplot matrices (sploms) are tables of scatterplots with each variable plotted against all of the others. They give excellent initial overviews of the relationships between continuous variables in datasets with small numbers of variables.

Crime in the U.S.

The dataset crime.us from the VGAMdata package includes the absolute crime figures and the crime rates by population for the fifty U.S. states in 2009.

library(GGally);
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
data(crime.us, package="VGAMdata")
crime.usR <- crime.us
names(crime.usR) <- gsub("*Rate", "", names(crime.usR))
names(crime.usR)[19:20] <- c("Larceny", "MotorVTheft");
ggpairs(crime.usR[, c(13:16, 18:20)],
title="Crime rates in the USA",
diag=list(continuous='densityDiag'), axisLabels='none')

The figure above shows a splom of the rates for seven kinds of crime. The dataset also includes rates for the four crimes of violence together and density estimates for the variables down the diagonal.


Functions for drawing sploms

function package
plot
pairs
spm car
cpairs gclus
splom lattice
ggpairs GGally
gpairs gpairs
pairs.mod SMPracticals
pairscor.fnc languageR

Note: Sploms are ineffective for large number of variables (a splom for m variables includes m(m-1) scatterplots and m diagonal elements).

2.7. Scatterplot options

Point size

Symbols for points

Alpha-blending

Colouring points

2.8. Modelling and testing for relationships between variables



Correlation


Regression


Smoothing


Bivariate density estimation


Outliers

2.9. Main points




3. Investigating Multivariate Continuous Data

3.1. Introduction

Traditional methods used for displaying multivariate continous data:

New Method: parallel coordinate plots.

3.2. What is a parallel coordinate plot (pcp)?

library('GGally'); data(food, package="MMST")
names(food) <- c("Fat", "Food.energy", "Carbohyd", "Protein", "Cholest", "Wt", "Satur.Fat")
food1 <- food/food$Wt
ggparcoord(data = food1, columns=c(1:5, 7), scale="uniminmax", alphaLines=0.2) + 
  xlab("") + ylab("")

In a parallel coordinate plot:


Functions for drawing pcp’s in R

function package
ggparcoord GGally
parcoord MASS
cparcoord gclus
pcp PairViz
parallelplot lattice
parallel.ade epade
ipcp iplots

3.3. Features you can see with parallel coordinate plots

Parallel coordinate plots are very useful for checking results like:


Illustrating Fisher’s iris dataset

ggparcoord(iris, columns=1:4, groupColumn="Species")

3.4. Parallel coordinate plots and time series

In a parallel coordinate plot used for representing time series:

Note: ipcp method of iplots package is better for creating interactive time series pcp’s.


Illustrating the acres of corn planted by US states from 1866 to 2011

library(reshape2); data(nass.corn, package="agridat")
c1 <- melt(nass.corn, id=c("year", "state"))
c1 <- within(c1, StateV <- interaction(state, variable))
c2 <- dcast(c1, StateV~year)
ggparcoord(subset(c2[1:48,], c2[1:48,147]> 250000), columns=2:147, groupColumn="StateV", scale="globalminmax") + 
  xlab("Year") + ylab("Acres") + 
  scale_x_discrete(breaks=seq(1865, 2015, 10)) + 
  theme(legend.position = "none")

3.5. Parallel coordinate plots for indices

Usually, index values are weighted combinations of the values of their components. Examples:

One way to represent them in pcp’s:


Illustrating the Russell group from the uniranks dataset

data(uniranks, package="GDAdata")
names(uniranks)[c(5, 6, 8, 10, 11, 13)] <- c("AvTeach", "NSSTeach", "SpendperSt", "Careers", "VAddScore", "NSSFeedb")
uniranks1 <- within(uniranks, StaffStu <- 1/(StudentStaffRatio))
uniranks2 <- within(uniranks1,
                    Rus <- ifelse(UniGroup=="Russell", "Russell", "not"))
ggparcoord(uniranks2[order(uniranks2$Rus, decreasing=FALSE),],
           columns=c(5:8, 10:14),
           order=c(5,12,8,9,14,6,13,7,11,10),
           groupColumn="Rus", scale="uniminmax") +
           xlab("") + ylab("") +
           theme(legend.position = "none",
           axis.ticks.y = element_blank(),
           axis.text.y = element_blank()) +
           scale_colour_manual(values = c("gray","red")) + 
           geom_line(size = 0.8)

3.6. Options for parallel coordinate plots

3.6.1. Scaling

Depending on the situation, there are different types of scaling:
library('gridExtra'); data(body, package="gclus")
body1 <- body
names(body1) <- abbreviate(names(body), 2)
names(body1)[c(4:5, 11:13, 19:21)] <- c("CDp", "CD", "Ch", "Ws", "Ab", "Cl", "An", "Wr")
a1 <- ggparcoord(body1, columns=1:24, alphaLines=0.1) + xlab("") + ylab("")
a2 <- ggparcoord(body1, columns=1:24, scale="uniminmax", alphaLines=0.1) + xlab("") + ylab("")
a3 <- ggparcoord(body1, columns=1:24, scale="globalminmax", alphaLines=0.1) + xlab("") + ylab("")
a4 <- ggparcoord(body1, columns=1:24, scale="center", scaleSummary="median", alphaLines=0.1) + xlab("") + ylab("")
grid.arrange(a1, a2, a3, a4)

3.6.2. Outliers

Sometimes we want to redraw the plot without showing the outliers. There are 3 ways:
Remove the cases with outliers (food dataset)
fc <- function(xv) {
  bu <- boxplot(xv, plot=FALSE)$stats[5]
  cxv <- ifelse(xv > bu, NA, xv)
  bl <- boxplot(xv, plot=FALSE)$stats[1]
  cxv <- ifelse(cxv < bl, NA, cxv)
}
data(food, package="MMST")
rxfood <- as.data.frame(apply(food,2,fc))
ggparcoord(data = rxfood, columns = c(1:7), 
           scale="uniminmax", missing="exclude", 
           alphaLines=0.3) + xlab("") + ylab("")

Trim all outliers to the chosen limits (food dataset)
fb <- function(xv) {
  bu <- boxplot(xv, plot=FALSE)$stats[5]
  rxv <- ifelse(xv > bu, bu, xv)
  bl <- boxplot(xv, plot=FALSE)$stats[1]
  rxv <- ifelse(rxv < bl, bl, rxv)
}
data(food, package="MMST")
rfood <- as.data.frame(apply(food,2,fb))
ggparcoord(data = rfood, columns = c(1:7), 
           scale="uniminmax", alphaLines=0.3)

Restrict the plot to the chosen limits (food dataset)
fd <- function(xv) {
  bu <- boxplot(xv, plot=FALSE)$stats[5]
  bl <- boxplot(xv, plot=FALSE)$stats[1]
  dxv <- (xv - bl)/(bu - bl)
}
data(food, package="MMST")
rofood <- as.data.frame(apply(food,2,fd))
ggparcoord(data = rofood, columns = c(1:7)) + 
           coord_cartesian(ylim=c(0,1))

3.6.3. Formatting

Type of display
Missings
Aspect ratio
Orientation
Lines
Colour
data(Boston, package="MASS")
Boston1 <- within(Boston,
                  hmedv <- factor(ifelse(medv == 50,"Top", "Rest")))
Boston1 <- within(Boston1, mlevel <- ifelse(medv==50,1,0.1))
Boston1 <- within(Boston1, medv1 <- medv)
a <- ggparcoord(data = Boston1[order(Boston1$hmedv),],
                columns=c(1:14), groupColumn="hmedv",
                scale="uniminmax", alphaLines="mlevel",
                mapping = aes(size = 1)) + xlab("") + ylab("") +
                theme(axis.ticks.y = element_blank(),
                axis.text.y = element_blank())
b <- ggparcoord(data = Boston1, columns=c(1:14),
                groupColumn="medv1", scale="uniminmax") +
                xlab("") + ylab("") +
                theme(axis.ticks.y = element_blank(),
                axis.text.y = element_blank())
grid.arrange(a,b)

Alpha-blending

The option alphaLines in ggparcoord can be used:

library('dplyr')
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Boston1 <- Boston1 %>% mutate(
           arad = factor(ifelse(rad < max(rad), 0, 1)),
           aLevel = ifelse(rad < max(rad), 0.1, 1))
ggparcoord(data = Boston1, columns=c(1:14),
           scale="uniminmax", groupColumn= "arad",
           alphaLines="aLevel", order="allClass") +
           xlab("") + ylab("") +
           theme(legend.position = "none",
           axis.ticks.y = element_blank(),
           axis.text.y = element_blank())

3.7. Main points

THANK YOU!